Data Science Interview Questions And Answers
Ans. The study of data is called Data Science. The study involves defining standard methods of recording, extracting, analysis, storing of the data so that useful and needful information is represented.
The ultimate goal of Data Science is to get a better insight into the data, irrespective of being structure or unstructured format.
The below tabular format will provide more details about Data Science, Machine Learning and Artificial Intelligence
Ans. A botnet is a type of bot running on an IRC network that has been created with a Trojan.
Ans. Data visualization is a common term that describes any effort to help people understand the significance of data by placing it in a visual context.
Ans. Cleaning up data to the point where you can work with it is a huge amount of work. If we’re trying to reconcile a lot of sources of data that we don’t control like in this flight, it can take 80% of our time.
Ans.
Ans. Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques. Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system.
Ans. Data is collected from sensors in the environment.
Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing.
Exploratory data analysis and statistical modelling may be performed.
A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment.
Inclined to build a profession as Data Science Developer? Then here is the blog post on, explore Data Science Training
Ans. A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
Ans. Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
Ans. Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.
Ans. SAS is commercial software whereas R is free source and can be downloaded by anyone. SAS is easy to learn and provide an easy option for people who already know SQL whereas R is a low-level programming language and hence simple procedures take longer codes.
Ans. R is a low-level language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at BELL.
Ans. Estimating a value from 2 unknown values from a list of values is Interpolation. Extrapolation is approximating a value by extending a known set of values or facts.
Ans. The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents.
Ans. Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
Ans. They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
Ans. P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.
Ans. No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.
Ans. A/B testing is another form of testing where two variables are taken into consideration and the result is derived. The use of the A/B testing is vital because the outcome of the testing result will also provide improvements to the system. For example, A/B testing for a web page will yield in understanding the current state of the web page and also the testing result will provide necessary enhancements and feedback to the current web page layout.
Ans. Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
Ans. Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –
Ans. There are various methods to assess the results of logistic regression analysis-
Ans.
Ans. The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question-
Ans. The word machine learning is widely used in the data analysis world. Machine learning is nothing but an application of Artificial Intelligence where the algorithm is executed automatically to learn about the data without being programmed. So while executing the algorithms, the data is parsed and patterns are determined and predicted accordingly.
The following are the uses of Machine Learning:
Ans. Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid Overfitting of the model being built. On the other hand, the test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as-
Ans. The best possible answer for this would be Python because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
Ans. Logistic regression is a process where the model is set to deliver results where the inputs are two linear predictor variables and the output is derived in the form of binary values.
Logistic regression is explained with an example:
The output of the logistic regression is whether the politician will win or not. I.e. The output is derived in the form of binary values ( 1 or o)-Win or Lose.
To derive the output, the inputs are the following :
Ans. These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at the time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
Ans. Part of a data scientist's role in certain companies involves working closely with the product teams to help define, measure, and report on these metrics. This is an exercise you can go through by yourself at home, and can really help during your interview process.
Ans. The following are a few common data quality issues :
Constraint ranges
A mixture of different languages
Noise in the Data set
Missing values / Null Values
Outliers in the Data set.
Ans. Univariate analysis is nothing but an analysis process where a certain format of data is represented to business users. But in this representation, only one variable is highlighted and the data is represented. For example, considering the sales figure based on a particular country is one variable within a pie chart. So this form of study and analysis is called univariate analysis.
Ans. As the name suggests bivariate analysis is a process where two variables are considered for data analysis and representation. For example, understanding the amount of money spent on marketing vs the number of sales triggered. This study or analysis is called as bivariate analysis.
Ans. As the name suggests, multivariate analysis is nothing but a process where more than 2 variables are considered for analysis purposes.
Ans. The normal distribution is one of the vital statistical data distribution process or a pattern where the data points are equally distributed so that a bell curve can be achieved.
Ans. A power analysis is a standard analysis process that is widely used to identify and estimate the bare minimum sample size that is required to conduct or to organize an experiment.
Ans. Supervised learning is a technique or a process where the human will train the machine with a “labeled” data set. In a sense, the correct answer is associated with the data set. This form of training or learning will help to predict the outcomes.
To build a robust model, the trainer or the supervisor has to spend a quality amount of time to execute this. Also, if the data insights change then the data model should also be changed accordingly.
On the other hand, the unsupervised learning technique is a process where there is no need for external human interference. The model is left on itself to understand the data and discover the information on the fly. The data that is associated with this type is the unlabelled data set. The unsupervised algorithms are capable of handling high complex tasks when compared to supervised learning algorithms.
Ans. A confusion matrix is nothing but a 2x2 matrix where the output is derived in the form of counts. The summary of results contains both correct predictions and incorrect predictions. With the help of the confusion matrix, one can get to know the type of errors that are made by the classifier.
Ans. Correlation and Covariance are two statistical methods where the relationship between two variables is defined. These two statistical methods are commonly used in statistics and probability scenarios.
Covariance:
Correlation:
Ans. Re-sampling is a process where the sample data sets are repeated from the standard original data sets. Usually, re-sampling methods involve experimental methods compared to analytical methods. This helps in generating a unique sampling distribution.
Usually, the following are the different ways where re-sampling can be executed:
Ans. Within the machine learning and statistical analysis world, one of the vital tasks is to fit a model to a training data set so that the model can predict the data. In this context we have two concepts that are closely related to better prediction of the data, they are overfitting and underfitting.
Overfitting:
As the word suggests, the data model is complex and has too many parameters when compared to the number of observations. Usually, the overfitted models result in weak productive performance because the data model reacts to the slightest fluctuations in the data.
Underfitting:
This is another concept in a statistical analysis where the algorithm cannot capture the trend of the data. For example, fitting a linear data model to a non-linear data set. As the comparison is not properly, the model cannot produce quality or strong predictive output and results in poor performance in terms of prediction.
Ans. In reality, both the languages are open-source languages where the implementation is wide and has a lot of user base, especially when it comes to data analysis these languages are widely used.
Python:
R:
Ans.
Ans. In the computing world, a star schema is nothing but a simplistic format of data mart schema where it is widely used to build data warehouses and dimensional data marts. A star schema has 1 or more than 1 fact table which has a reference of a number of dimension tables.
The name comes from its appearance where the fact table is actually present in the centre and the dimension tables are surrounded by its start points.
Ans. Data sampling is one of the statistical analysis techniques which is widely used to select, translate, manipulate and analyze a certain subset of the data. This will provide information about the patterns, trends in the larger data set.
They are different sampling methods in place, the following are some techniques that are used to analyze the data sets.
Ans. A validation set is nothing but a training set that is used for parameter selection. Also, one more important consideration is to make sure that the model is not overfitted.
Ans. As the name indicates it provides a pictorial form of all the connections to the process. A decision tree is one of machine learning algorithms- supervised which is mainly used for Classification and Regression purposes.
Within this process, the entire data set is broken down into smaller datasets that are associated with a decision tree. Usually, the decision tree is in an incremental format where it shows the relation between the steps. The output of the decision tree is to have a linear fashion of flow chart which has decision nodes and leaf nodes.
A decision tree is well capable of handling categorical data sets and numerical data sets.
Pruning is an effective technique in machine learning which is primarily used to reduce the decision tree size. Using this process, the complexity of the classifier is also reduced which will eventually increase the predictivity.
Ans. The term “Boosting” refers to a set of algorithms where it is primarily used to enhance weak learners to perform better and make them strong learners. Using this concept, the algorithms are enhanced in such a way that the results are better compared to the initial stages of the algorithm.
Boosting is a method where the weak algorithms are tweaked and enhanced in sequential order. As the process is sequential, the predecessor algorithm is always stronger.
They are three different types of boosting:
Ans. The following are the cases where the algorithm is updated.
Ans. Deep learning is another perspective of machine learning altogether. Within deep learning, the concept of algorithms is also considered where the structure is inspired by brain functions. This is often called artificial neural networks.
Ans. Reinforcement learning is another technique that is oriented towards enhancing output. This learning process emphasizes on what are the activities that one has to do and how these should be aligned to the actions. By doing these two sets of actions, the result will enhance the reward signal.
In this process, the learner should not pick the action item but in turn, has to discover the action item which will yield a better result. The process is derived from the human learning process, where a huge importance is given to the reward/penalty mechanism.
Ans. A hyperparameter is nothing but a predefined parameter where the value is set/defined before the learning process is executed. This helps in terms of understanding how a network is trained and network structure.
For example :
Ans. CNN stands for Convolutional Neural Network. Within this network, they are four different layers that are available.
Ans. They are three different variants in backpropagation, they are listed below:
Within this variant, only one training example is used for calculation of the gradient and accordingly the parameters are updated.
The gradient is calculated for the entire dataset and update action is performed for every iteration.
It is considered as one of the best optimization algorithms. It works exactly like that of stochastic gradient descent process but within this process instead of taking a single training example, it considers mini-batches.
Ans. Different deep learning frameworks are listed below:
Ans.
Ans. The following are certain skills that are vital for an individual to have to excel in data analysis :
Ans. A uniformed distribution is a case where the data is actually spread equally in all respective ranges.
A skewed distribution is a case where the data is actually spread across any one side of the plot. Usually, skewed distribution will have either left/right-skewed data distribution.
Ans. Precision is defined as a process where the percentage value of correct predictions is taken into consideration.
The recall is defined as a process where the number of percentage predictions is validated ( i.e. actually the predictions that were proved to be true).
Ans. The data is collected from various social media channels like Twitter, Facebook, etc. With the help of their API’s the information is gathered. For example, the data can be collected from a single Tweet, i.e. Tweeted date, Number of retweets, course, content of the tweet, number of favourites for the tweet, etc.
Using all of this information, a multivariate time series model is equipped to predict the answer.
Ans. It is advised to run through the features in a Gradient Boosting Machine or Random Forest process where the plots are generated with relative importance. Further, it is advised to look for the variables that were added in the forward variable selection process.
Ans. As per Naive Baye's point assumptions, all the independent variables are important and they are independent of each other. In reality, the idea is not supported to an extent. But this process works better for problem classification.
Ans. Within the multinomial distribution, the values are assigned as n=12 and k=3, the outcome is as per multinomial distribution. The classes are distinct.
Ans. The factor of having more data can actually cause more issues if they are not managed well, few of them are listed below for reference:
Ans. A high dimensionality results in a hard form of a cluster where it has to accommodate a number of dimensions within it.
For example:
To cover a fraction of the data volume, the model has to capture a wide range of variables.
Ans. A confidence interval is a percentage value that is used at the time of construction for a set of samples where each sample has the same value, thus the mean value of the constructed intervals would be the same. For example, if the confidence interval is planned with a percentage of 95% then the mean value will also be 95%.
Ans. A correlation is a process where it helps to understand the relationship between two or more variables.
Causation is a process where it depicts the causal relationship between the two events. Further, they also represent the cause and effect.
Causation can talk about correlation but correlation doesn’t really mean causation.
You liked the article?
Like: 0
Vote for difficulty
Current difficulty (Avg): Medium
TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.